Information retrieval based on domain term extraction and query classification algorithms

Topic > Information retrieval based on domain term extraction and query classification algorithms

Abstract—Information Retrieval (IR) system finds relevant documents from a large dataset based on the query user. Queries submitted by users to search engines may be ambiguous, concise, and their meaning may change over time. As a result, understanding the nature of the information needed behind queries has become an important research problem. Therefore, various search engines emphasize query ranking. For an efficient IR system, this system proposes query classification algorithm (QCA) and domain term extraction algorithm. This system classifies queries into each predefined target category. In query classification, domain terms are extracted from the query and each of them is classified into the relevant categories stored in the database. Using QCA categories, this system finds the relevant document from the document library. The vector space IR model is used in this system to retrieve the relevant document. Say no to plagiarism. Get a tailor-made essay on "Why Violent Video Games Shouldn't Be Banned"? Get an original essayI. INTRODUCTION Information Retrieval (IR) system finds relevant documents from a large dataset based on the user's query. IR is composed of basic components such as indexing, searching, and document classification. Current IR systems, including search engines, have a standard interface consisting of a single input box that accepts keywords. User-submitted keywords are compared against the collection index to find documents that contain those keywords. When a user's query contains multiple topic-specific keywords that accurately describe their information need, the system is likely to return valid matches; however, when the user query is short and the natural language is inherently ambiguous, this simple retrieval model is usually prone to errors and omissions. Understanding the meaning of search queries is a key task that lies at the heart of search research. Query classification is a difficult task since queries usually consist of only a few terms, often leading to significant ambiguity. Semantic logics are very important in understanding queries to create a successful search engine. A user may not formalize the query when looking for information even if he knows what he wants. As a result, understanding the nature of the information needed behind queries has become an important research problem. Therefore, this system proposes domain term extraction algorithm and query classification algorithm (QCA). In the proposed system, the conceptual term strategy is used to identify the relevant category with the ambiguous domain term. This system stores conceptual terms in the NoSQL graph database. Based on the concept of term strategy and NoSQL graph database, this system uses QCA to classify query characteristics and ambiguous domain terms. Using the classified user query, this system carries out the information retrieval process. In the query classification-based IR system, QCA and vector space model are used to retrieve information relevant to the user's query. According to the results of the conceptual terms analysis, this system becomes a good IR system by extracting documents that are more relevant to the user's needs. The rest of the document is organized as follows: the workrelated is described in section 2. The basic theory is shown in section 3. The proposed system design is presented in section 4. The proposed methodology is described in section 5, and the experimental result of the system is presented in section 6. Finally , the conclusion is given in section 7.II. RELATED WORKS In 2006, W. Yue, Z. Chen, and X. Lu proposed a new information retrieval algorithm based on query expansion and classification. The algorithm is driven by the observation that very short queries with traditional information retrieval methods often have low precision, although they can achieve high recall. Their approach attempted to capture more relevant documents by expanding queries and text classification. Experiment results showed that the proposed algorithm is more accurate and efficient than traditional query expansion methods. In 2012, SM Fathalla and YF Hassan presented a hybrid method for reforming and classifying user queries based on fuzzy semantics-based approach and K-Nearest Neighbor Classifier (KNN). The overall processes of the system are query preprocessing, fuzzy membership computation, classification and query reforming. Classification is performed using the KNN classifier not only based on keyword-based semantics, but using sentence-level semantics. After classification, the user's query is reformulated to be sent to a search engine which provides better results than sending the original query to the search engine. The experiments show a significant improvement in search results compared to results from traditional keyword-based search engines. In 2015, C. Xia and X. Wang adopted a new web query classification method. Their method consists of three steps. In the first stage, some context information is labeled to enrich their training set. In the second step, the list of labeled queries is divided into word sequences, and then a graph is constructed whose nodes and edges are indexed with category labels. Next, a line equation is trained to evaluate the possibility that a given query belongs to a given category. Their method can reduce the training time by 10% compared to Support Vector Machine (SVM).III. UNDERLYING THEORY. Domain Term ExtractionDomain term extraction is a categorization or classification activity in which terms are classified into a set of predefined domains. It has been applied to tasks such as key phrase extraction, word sense disambiguation, multilingual text categorization, and query classification.B. Query ClassificationQueries submitted by users to search engines may be ambiguous, concise, and their meaning may change over time. Nowadays, query classification is emphasized by several search engines due to the increasing size of the web, as millions of resources are added every day. Query classification assigns a search query to one or more predefined categories, based on its topics. It consists of classifying a user query Qi into a list of n categories ci1, ci2, cin. The importance of query classification is highlighted by many services provided by the search engine. A direct application is to provide better search result documents for users in the interest of different categories. Search results pages can be grouped according to the categories provided by the query classification method. The classification ofquery is a two-step process. The first is the learning phase where a classification model is built. The second is the classification phase where the model is used to predict the class label for certain data. If a given category is provided in an intermediate taxonomy, the query classification is directly mapped to a target category if and only if the following condition is met: one or more terms in each node along the path in the target category appear along the corresponding path combined intermediate category.C. Information Retrieval The Information Retrieval (IR) system can accept a user query, understand the user's needs, search a database for relevant documents, retrieve documents for the user, and classify documents based on their relevance. There are four main IR models. These are as follows: 1) Boolean Model: A document matches the query if the set of terms associated with the document satisfies the Boolean expression representing the query. The Boolean expression of terms uses the standard Boolean operators: and, or, and not. The result of the query is the set of corresponding documents.2) Vector space model: In the vector space model the text is represented by a vector of terms. Terms are typically words and phrases. If words are chosen as terms, then each word in the vocabulary becomes an independent dimension in a very high-dimensional vector space. Any text can therefore be represented by a vector in this high-dimensional space. If a term belongs to a text, it takes on a non-zero value in the text vector along the dimension corresponding to the term. A vector-based IR method represents both documents and queries with high-dimensional vectors, calculating their similarities based on the vector inner product. 3) Language model: Statistical language models are based on probability and have foundations in statistical theory. It first estimates a language model for each document, and then classifies the documents based on the query probability provided in the language model. 4) Probabilistic model: Probabilistic IR models estimate the relevance probability of documents for a query. This model is based on probability theory. It can be estimated based on the relevance of a certain document based on their query.IV. DESIGN OF THE PROPOSED SYSTEM There are three main phases in this system. In the first stage, this system uses the domain term extraction algorithm to extract domain terms from the user's query. In the second step, this system classifies each extracted domain term into each category using the QCA graph database and Neo4j. In the final stage, this system retrieves information relevant to the user's query using classified queries.V. PROPOSED METHODOLOGY In this system, algorithms for extracting domain terms and query classification are proposed. Using ranked queries, this system retrieves the relevant information according to the IR model of the vector space. Vector Space IR Model In the vector space IR model, a document is represented as a vector of term weights. The number of dimensions in the vector space is equal to the number of terms used in the overall collection of documents. A query in the vector space model is treated as if it were simply another document allowing the same vector representation to be used for queries as documents. This representation naturally leads to the use of the vector inner product as a measure of similarity between the query and a document.1) Schema).