PathSim
Similarity search is a primitive operation in database and Web search engines. With the advent of large-scale heterogeneous information networks that consist of multi-typed, interconnected objects, such as the bibliographic networks and social media networks, it is important to study similarity search in such networks. Intuitively, two objects are similar if they are linked by many paths in the network. However, most existing similarity measures are defined for homogeneous networks. Different semantic meanings behind paths are not taken into consideration. Thus they cannot be directly applied to heterogeneous networks. In this paper, we study similarity search that is defined among the same type of objects in heterogeneous networks. Moreover, by considering different linkage paths in a network, one could derive various similarity semantics. Therefore, we introduce the concept of meta path-based similarity, where a meta path is a path consisting of asequence of relations defined between different object types (i.e., structural paths at the meta level). No matter whether a user would like to explicitly specify a path combination given sufficient domain knowledge, or choose the best path by experimental trials, or simply provide training examples to learn it, meta path forms a common base for a network-based similarity search engine. In particular, under the meta path framework we define a novel similarity measure called PathSim that is able to find peer objects in the network (e.g., find authors in the similar field and with similar reputation), which turns out to be more meaningful in many scenarios compared with random-walk based similarity measures. In order to support fast online query processing for PathSim queries, we develop an efficient solution that partially materializes short meta paths and then concatenates them online to compute top-k results. Experiments on real data sets demonstrate the effectiveness and efficiency of our proposed paradigm.
LSH forest: self-tuning indexes for similarity search
We consider the problem of indexing high-dimensional data for answering (approximate) similarity-search queries. Similarity indexes prove to be important in a wide variety of settings: Web search engines desire fast, parallel, main-memory-based indexes for similarity search on text data; database systems desire disk-based similarity indexes for high-dimensional data, including text and images; peer-to-peer systems desire distributed similarity indexes with low communication cost. We propose an indexing scheme called LSH Forest which is applicable in all the above contexts. Our index uses the well-known technique of locality-sensitive hashing (LSH), but improves upon previous designs by (a) eliminating the different data-dependent parameters for which LSH must be constantly hand-tuned, and (b) improving on LSH's performance guarantees for skewed data distributions while retaining the same storage and query overhead. We show how to construct this index in main memory, on disk, in parallel systems, and in peer-to-peer systems. We evaluate the design with experiments on multiple text corpora and demonstrate both the self-tuning nature and the superior performance of LSH Forest.
time series recurrent neural network metric space health care discrete wavelet transform sample size confidence interval discrete fourier transform systematic review dimensionality reduction internet service euclidean distance traffic engineering internet service provider web search engine amino acid internet traffic intensive care unit time warping similarity search background and objective x-ray computed tomography heart failure traffic classification large time body mass index early diagnosi evaluation procedure dimensionality reduction technique growth factor internet routing kidney disease signal transduction symmetric encryption chronic kidney disease sequence database chronic kidney time series database today internet scaling behavior internet backbone searchable symmetric encryption cardiac surgery series database internet traffic classification searchable symmetric oxidative stres publication bia cell surface efficient similarity external validation large time series efficient similarity search time warping distance glomerular filtration rate effective sample size hospital admission fast similarity fast similarity search plasma membrane acute kidney injury acute kidney kidney injury internet traffic engineering approximate similarity search search in large dynamic searchable area under curve kidney transplantation today internet traffic sse scheme dynamic searchable symmetric radical polymerization abbott laboratory traffic classification technique renal replacement therapy cell physiology ckd patient wide-area internet internet traffic measurement improved definition fibroblast growth factor chain transfer biological marker fibroblast growth genetic heterogeneity lipid raft excretory function cns disorder entity name part qualifier - adopted cessation of life standards characteristic complement system protein one thousand hypertensive disease limited stage (cancer stage) tissue membrane glutathione s-transferase adverse reaction to drug diameter (qualifier value) congenital abnormality kidney failure, chronic renal insufficiency creatinine measurement, serum (procedure) forecast of outcome stage level 1 microgram per liter milliliter per minute diagnosis, clinical vesicle (morphologic abnormality) lipid metabolism disorder transplanted tissue membrane protein traffic stage level 3 cfh gene hemolytic-uremic syndrome kidney failure, acute blighia sapida creatinine clearance measurement cystatin c (substance) stage level 5