Combining Web Data Extraction and Data Mining Techniques to Discover Knowledge
The design and implementation of a support system for Knowledge Discovery is the challenge of many researchers. As Data Mining is the main key step in Knowledge Discovery process in Databases (KDD), it is necessary to find a new methodology that combines web data extraction playing the role of data collection from the web and data mining techniques on the extracted categorical data in order to discover knowledge. The main contribution of this research is proposing a methodology to apply the clustering notion on categorical web data and to use the clustering results as part of the input for the classification conducted on another set of data. Data mining and relative data processing are conducted by developing intelligent tools. The performance of the algorithms used in our methodology is demonstrated with the clustered job postings dataset and classified job searchers dataset by using the three measures accuracy, recall and precision for the clustering algorithm and the error of classification for the classification technique. The results show that our proposed approach of combination ends up with good results in Knowledge Discovery from the web.
Web data extraction is a process that implies the development of mechanisms that allow a user, who is not necessarily specialist, to retrieve and automatically extract structured information from unstructured or semi-structured web data sources. Extracting structured data objects based on several techniques enables to integrate data and information from multiple Web pages into a single database. Web data extraction tools use wrappers to extract data from input. These wrappers tokenize the input string prior applying the extraction rules for each attribute. They assemble the extracted values into records and repeat the process for all object instances in the input. With the huge amount of data existing on the Web, web data extraction is being widely associated with data mining techniques to extract and analyze online shopping data, prices, e-commerce sites, financial sites and job recommendation sites. To give a few examples on related works, Nahm and Mooney describe a system called DISCOTEX (DISCOvery from Text EXtraction) that combines information extraction and data mining techniques on extracted job postings from corpus of computer job announcement texts.
Our approach in different ways mainly at: – First, processing job offers to reach well-defined categorical data such as salary, required experience. – Second, adopting k-mode to cluster similar job offers knowing that k-mode is used to cluster categorical data. – Third, automatically assigning clustered job offers to job searchers based on the same defined fields for the job postings. The combination of different techniques such as web data extraction, k-mode and Naïve Bayes would be beneficial in evaluating the effectiveness of these approaches in our application domain. this research proposes developing an intelligent tool called “JobsMining” that combines the two techniques of web data extraction and data mining. In this section, we define the techniques used in each of the two main phases of the methodology: 1. extracting job postings from several sources in order to end up with the processed and consolidated dataset 1. 2. clustering the categorical data in dataset 1 into a relatively small set of groups and classifying another dataset 2 (job searchers) in such a way to achieve predicting how new instances will behave.
In this paper, we firstly used an embedded tool for web data extraction to collect job postings, secondly, we identified characteristics of the most useful data mining techniques and developed two algorithms, k-mode clustering and Naïve Bayesian classification, that can be used to predict useful fields. We ended up with two meaningful clusters of job postings with fair distribution (62% for C1 and 38% for C2) in which the cluster 2 is matching to all the job postings requiring high qualifications. The experimental results show that the accuracy of the clustering algorithm is 92.53%. As for the classification, we deduced that the classification error decreases with the increase of the training set. For 75% training set with 25% validation, we found the classification error 4.9% which is satisfactory. Despite the absence of results comparison with previous works adopting the same methodology, we consider that these results are good. As this case study contributes in promoting employability, concerned parties could take advantage of such tools to support people to find jobs with the collaboration of recruitment agencies. In addition to providing a recommended tool for job searchers, our approach could be developed in the future to contribute for gaining insights on the required skills and the distribution of jobs across the sectors and countries in the region. Furthermore, with the absence of a valid occupations list at the national level, such methodologies could be adopted to conduct studies on the labor market needs and take proper actions for preparing future employees to fulfill these needs.
 Ng, M. K., Li, M. J., Huang, J. Z., & He, Z. (2007). On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(3).
 Huang, Z. (1997). A fast clustering algorithm to cluster very large categorical data sets in data mining. DMKD, 3(8), 34-39.
 Chang, C. H., Kayed, M., Girgis, M. R., & Shaalan, K. F. (2006). A survey of web information extraction systems. IEEE transactions on knowledge and data engineering, 18(10), 1411-1428.
 Mohaghegh, S. D. (2003). Essential Components of an Integrated Data Mining Tool for the Oil & Gas Industry, With an Example Application. In in the DJ Basin. Paper SPE 84441 presented at the SPE Annual Technical Conference and Exhibition.
 Tan, P. N. (2006). Introduction to data mining. Pearson Education India.
 Nahm, U. Y., & Mooney, R. J. (2000, July). A mutually beneficial integration of data mining and information extraction. In AAAI/IAAI (pp. 627-632).
 Amato, F., Boselli, R., Cesarini, M., Mercorio, F., Mezzanzanica, M., Moscato, V., … & Picariello, A. (2015, February). Challenge: Processing web texts for classifying job offers. In Semantic Computing (ICSC), 2015 IEEE International Conference on (pp. 460-463). IEEE.
 Poch, M., Bel, N., Espeja, S., & Navio, F. (2014). Ranking Job Offers for Candidates: learning hidden knowledge from Big Data. In LREC (pp. 2076-2082).