Questions tagged [data-mining]
Using the techniques of artificial intelligence and machine learning to extract patterns from large data sets and transforming those data into a useful, organized form for future processing.
142 questions
0
votes
0
answers
64
views
Measuring logicality of programming languages?
I have a simple question of how would you measure the logicality of a programming language?
EDIT: I was asked to specify the term "logicality". Hence I will try and provide a stipulation. By ...
1
vote
1
answer
173
views
What books are there to learn to implement these graph algorithms?
I saw a post on Reddit (https://www.reddit.com/r/math/comments/ci50d3/visualizing_mathematical_subjects/) that utilizes label propagation, Fruchterman-Reingold algorithm, and edge betweenness ...
1
vote
1
answer
75
views
Machine learning and test split for time series data
I have used different machine learning algorithms to predict solar panels' power output. There are ten independent features for weather data.
In all models, I set time as an index and have used the ...
1
vote
3
answers
2k
views
How can we express value of cosine similarity of two documents into percentage?
We were doing project work for plagiarism checking. For this purpose, we have taken a term frequency vector of two documents and measured the similarity using a cosine similarity measure. The value of ...
1
vote
1
answer
58
views
Would samples be considered data redundancy if they are similar to each other fairly naturally?
I am working on building ML/DL solution for a problem where that data is considered, naturally similar and I am worried if that would be considered as data redundancy.
My question is, is that so? and ...
0
votes
0
answers
63
views
What are the confusion matrix values?
I'm currently going through past paper questions and was wondering if I could get some help answering this one?
'Consider a classification model which is applied to a set of records, of which 100 ...
0
votes
1
answer
184
views
How to detect outliers using DBSCAN?
I am working on a Fraudulent Cash Transaction Detection System using DBSCAN and I want to know what is the proper way to identify outliers?
Thank you
##Edite##
I had a problem how to represent the ...
0
votes
0
answers
66
views
How to handle distribution of values with same attributes into different classes
I'm a student studying a data mining course and have come across a problem.
I need to explain the problem with the help of an example scenario as I do not know how to explain the problem in any other ...
3
votes
1
answer
236
views
How does an inverted index reduce storage requirements?
In p. 7 of the book "Introduction to Information Retrieval" (by Manning et al), the authors explain how, given a collection of text documents, an inverted index is built by tokenizing, then ...
1
vote
0
answers
65
views
Can anyone think of applications of a 3 way (k-way) dot product in computer science or data mining
I have developed a locality sensitive hashing algorithm for the 3-way or k-way dot product. When I say 3-way dot product I mean the following. Suppose we have
$x,y,z \in [-1,1]^{S}$ for $S \in \...
2
votes
2
answers
305
views
Why is it not always possible to compute the centroid of feature vectors?
Hi in the data mining and machine learning course that I'm taking there is a subject on feature spaces and there is this part about feature vector aggregation and metric spaces that I don't really ...
-1
votes
1
answer
96
views
Dimension Reduction - Which feature should remove to reduce the dimension of the matrix
Let's suppose that we have the following 2 tables:
If we want to reduce the dimension by one(in every table) which feature we should remove and why ?
I am confused about the way that i should work ...
3
votes
1
answer
63
views
Find plane within margin of error of >50% of points
There are $N < 3\times10^4$ 3D points. At least 50% of them lie approximately in the same plane, i.e. the distance between the plane and each point is at most $p$. Find such a plane.
Attempt: since ...
2
votes
0
answers
54
views
In topological data analysis, do bar codes that begin and end at the same index mean anything?
The typical workflow in topological data analysis is from point cloud data to filtration to a list of bar codes corresponding to each dimension.
A filtration is a sequence of simplicial complexes, ...
0
votes
0
answers
51
views
What are Key benefits of Ontologies in Systematic Literature Review?
I am working on a Systematic Literature Review (SLR) and about to done with data synthesis. After SLR, I want to create an Ontology and include different details of the SLR in Ontology. I have almost ...
1
vote
0
answers
174
views
Combining Computer Science and Humanities
I currently hold a bachelors in Computer science and a masters in Art History. I really want to combine the two and I know of Digital Humanities but I'm not completely aware of where Digital Humanists ...
0
votes
0
answers
127
views
Naviers Stokes equation and machine learning
I am looking for a reference explaining how to solve Navier-Stokes numerically using Machine learning algorithms .
Thank you in advance for your help .
4
votes
1
answer
339
views
Finding (and possibly extracting) source code in heterogenous text data set
I'm looking for a way to recognize and possibly extract source code from text files that may contain only source code, source code mixed with plain text or just plain text without any source code.
...
1
vote
0
answers
627
views
Algorithms for tabulating/counting/frequency counting?
It is common in data science to receive two equal length vectors (array of dimension 1), say Categories and Weights.
We aim to find all unique values of Categories and sum up the corresponding ...
1
vote
0
answers
44
views
How may I look for 'regions' of text in a larger corpus of different texts
I have an extremely large (100GB+) corpus of many different texts. All of them are in English and 'well' formatted. They are not loaded into any kind of database, think of them as a huge collection of ...
1
vote
0
answers
332
views
What is the best stream data clustering algorithm that can handle non-static, uncertain data? [closed]
I have gone through many algorithms including streaming k-means, CluStream etc and they all have their pros and cons. What is the best performing algorithm in terms of
Computational Complexity
Memory ...
0
votes
1
answer
959
views
List count of occurrences pairs, triplets, etc. from sets
A receipt is an array of products. I have an array of receipts.
I need to generate a report in where I can find the products often bought together.
For instance, for a single receipt where the ...
2
votes
2
answers
256
views
How to use Neural Network classification if data not same size?
I have data like this.
[0 1 0 1 0]
[0 1 0 1 0 1 1]
[0 1 0 1 ]
[0 1 0 1 0 1 1 1 1 0]
...
I want to classify with Neural Network but my data different size . I can ...
12
votes
5
answers
20k
views
Data Science vs Operations Research
The general question, as the title suggests, is:
What is the difference between DS and OR/optimization.
On a conceptual level I understand that DS tries to extract knowledge from the available data ...
1
vote
1
answer
244
views
Method for finding correlation between data sets
Let's say that I have $N$ data sets where I have data points at some fixed frequency, such as "daily".
What would be a good method for finding correlation between any of the data sets, or choosing a ...