DATA 525 Data Engineering and Mining
Software/Tools MySQL Oracle Perl PHP SQL
W3Schools Web (Oracle) Web (MySQL) Emacs Linux
Data
Retrieval &
Mining
Information
Retrieval I
Information
Retrieval II
Search
Engines
Text
Analysis
PageRank
Machine Learning Data Mining Kardi Teknomo ANN
Google APIs Firebase TensorFlow G4G TF G4G Firebase W3S TF
General Information Discord EE/CS Wiki EITS UND help Stack Overflow


Syllabus: Fall 2023   Credit hours: 3
Class times: 01:25pm – 02:15pm, MoWeFr Classroom: Harrington Hall 218
Class # (on-campus: 525-01): 8671 Class # (on-line: 525-02): 8672

Instructor: Wen-Chen Hu   (my teaching philosophy) Office: Upson II 366K
: https://und.zoom.us/j/2489867333 Email: wenchen@cs.und.edu
Office hours: 02:30pm – 04:30pm, MoWeFr

Prerequisites:
  • DATA 511 Computing for Data Science I,
  • DATA 512 Computing for Data Science II, and
  • DATA 513 Mathematical Foundations for Data Science, or
  • Permission of the School of Electrical Engineering and Computer Science
Synchronous class delivery: The class lectures will be delivered synchronously via https://und.zoom.us/j/2489867333, and the Zoom video will be posted on the Blackboard afterwards.

Lecture notes: No textbook will be used. Instead award-winning, interactive, informative, and practical lecture notes (based on books, papers, online documents, and user manuals) and detailed and precise class instructions will be provided. Collectively, the lecture notes and instructions are more like a small book, which supplies much more information than regular notes do and makes the subject studies much easier. Students will not have problem learning the subjects or taking the exams after studying them and doing programming exercises.



Grading:


Announcements:



Tentative Schedule:


Week

Class Topic Due Where
0 0. Computer Career and Data Research & Technologies    
  0.1 A computer career    
  0.2 Data research    
  0.3 Data technologies    
1 08/23
08/25
1. Introduction to DATA 525    
  1.1 Course introduction    
  1.2 Data life cycle    
  1.3 Topics covered    
2 08/28
08/30
09/01
2. Programming Exercise I    
  2.1 Specifications    
  2.2 Web page download    
  2.3 Code sample    
08/30  Last day to add a course or drop without record
 Last day to add audit or change to/from audit
 Last day to receive a refund on a dropped class
 Drops after the last day to add will appear on a transcript.
   
3 09/06
09/08
3. Essential Technologies for Exercise Construction    
  3.1 Essential software and tools    
  3.2 Writing HTML scripts    
  3.3 Using Unix/Linux    
09/04
Holiday, Labor Day (Monday) — no classes
   
4 09/11
09/13
09/15
4. PHP (HyperText Preprocessor)    
  4.1 LAMP    
  4.2 PHP    
  4.3 MySQL    
5 09/18
09/20
09/22
5. Web Search Services    
  5.1 The World Wide Web    
  5.2 Web page information    
  5.3 Web search methods    
6 09/25
09/27
09/29
6. Information Retrieval (IR)    
  6.1 Various IR methods    
  6.2 Automatic indexing methods    
  6.3 Data classification and clustering    
7 10/02
10/06
7. The PageRank Algorithm    
  7.1 Background EX I  
  7.2 The PageRank algorithm    
  7.3 Computing PageRank scores    
10/04
Exam I (for both on-campus and on-line students; 6:30pm – 8:00pm, Wednesday)
   
8 10/09
10/11
10/13
8. Decision Trees    
  8.1 Background    
  8.2 Measuring impurity    
  8.3 Information gain    
9 10/16
10/18
10/20
9. K-Nearest Neighbor (kNN) Algorithm    
  9.1 Background    
  9.2 kNN for prediction and smoothing    
  9.3 Strengths and weaknesses    
10 10/23
10/25
10/27
10. Artificial Neural Networks (ANNs)    
  10.1 Artificial intelligence    
  10.2 Backpropagation    
  10.3 Genann: a minimal ANN    
11 10/30
11/01
11/03
11. Firebase Database    
  11.1 Programming Exercise III    
  11.2 Introduction to Firebase    
  11.3 Using Firebase    
12 11/06
11/08
12. TensorFlow    
  12.1 TFJS operations    
  12.2 TFJS models    
  12.3 TFJS visor    
11/10
Holiday, Veterans Day (Friday) — no classes
   
11/09  Last day to change to or from S/U grading
 Last day to change to or from audit grading
 Last day to drop a full-term course or withdraw from school
   
13 11/17 13. A TensorFlow.js Example    
  13.1 Example introduction    
  13.2 Example model    
  13.3 Example training    
11/13
Student’s defense (Monday) — no classes
   
11/15
Exam II (for both on-campus and on-line students; 6:30pm – 8:30pm, Wednesday)
   
14 11/20 14. JavaScript    
  14.1 JavaScript syntax    
  14.2 JavaScript instructions    
  14.3 JavaScript examples    
11/22
11/23
11/24
Thanksgiving Break (WeThFr) — no classes
   
15 11/27
11/29
12/01
15. Data Mining Concepts    
  15.1 Introduction to data mining    
  15.2 Data mining steps    
  15.3 Data mining techniques  
16 12/04
12/06
16. Data Processing and Management    
  16.1 Data science    
  16.2 Data warehouse    
  16.3 Data fusion EX III  
17 12/13
Final exam (for both on-campus and on-line students; 06:30pm – 08:30pm, Wednesday)
   
18 12/19 Grades posted before noon, Tuesday    


According to IT Career Finder, Best Computer Jobs for the Future (05/18/2023) are listed as follows:

  1. IT security specialist (not developer)
  2.  Mobile application developer 
  3. Software engineer
  4. Video game designer (including developer)
  5. Computer systems analyst (not developer)
  6.  Web developer 
  7. Health information technician
  8. Technology manager
  9.  Database administrator (including developer) 
  10. Network administrator (not developer)


Computer science is different from many other disciplines (like electrical engineering). It is more like a professional school (such as culinary schools), which emphasizes practical works instead of subject studies because many IT companies want the new recruitees to start contributing immediately. There are three kinds of computing personnel:
  • Developers:

    • Positions (plenty): Developers of front-end and back-end web pages, mobile apps, and all kinds of software
    • Skills (more stable): Programming languages (such as C++ and Java), web programming, mobile app development, data processing and management including databases, and data structures & algorithms

  • Practitioners:

    • Positions (not many): Experienced personnel like data scientists, database or system administrators, security analysts, and network architects (more applications & configuration and less development)
    • Skills (based on the needs of companies): Databases, data warehousing, data lake, Hadoop, MapReduce, Linux, SPSS, SAS, Cogno, Matla, Taleau, etc.

  • Researchers:

    • Industrial positions (few and based on the needs of corporations): High quality personnel required for the advanced areas like artificial intelligence, security, computer vision, autonomous driving, and speech recognition
    • Academic positions/trends (few and changed according to the government policies): ❓ ⇐ artificial intelligence ⇐ big data ⇐ high-performance computing ⇐ security ⇐ (mobile) networks
Unless you have an impressive resume or a strong connection, practicing tens or hundreds of questions posted at the LeetCode is a must in order to secure a job at corporations (like Google and Facebook). Otherwise, your chance of answering the questions correctly is low because of their high difficulty and time constraint. In addition, you need to create LinkedIn pages to show your achievements, and may consider uploading your projects to the GitHub to showcase them.



Remark I: Terminologies and definitions will be discussed minimally in this course. Instead, (i) effective methods and practical works will be emphasized and enforced and (ii) the trend of data engineering and mining will be discussed.

Remark II: Unlike the disciplines such as databases or the World Wide Web, data engineering and mining (DEM) is one of the disciplines (like image processing or artificial intelligence) without coherent methods or algorithms. Many methods (such as artificial neural networks or relevance feedback) are used by DEM and each method is usually not closely related to other methods (like decision trees or sequential pattern mining).

Remark III: In order to show what the data engineering and mining (DEM) is in a semester, this course has to pick a small number of fundamental topics, instead of many topics, to investigate. Students then use the training to choose appropriate methods for the problems they encounter in the future.

Remark IV: Data engineering and mining (and information retrieval) is a mature subject. A wide variety of methods have been applied to it, and the current methods are rather complicated because of its maturity. In order to cover more topics, the methods introduced in this course are fundamental or primitive. Students learn how the DEM methods work, and may try to enhance the methods or apply them in their programming exercises.

Remark V: The DEM is a well-developed subject, and it is not easy to find a brand-new method. On the other hand, artificial intelligence (AI), data mining (DM), machine learning (ML), or information retrieval (IR) has plenty of methods available to be used or adopted. In order to take the advantages from both, the DEM borrows many methods from AI/DM/ML/IR. However, the DEM is not the same as AI/DM/ML/IR because of the problem of data processing. That is a data research topic may consist of two parts: DEM and AI/DM/ML/IR, and you want to put an emphasis on the former instead of the latter because the DEM is more useful and practical.

Remark VI: Take the following steps to conduct research:

  1. Identify a problem.
  2. Study related literature and methods.
  3. Create/adapt a method to solve/suit the problem.
  4. Figure out how to improve the method.
  5. Complete the implementation.
  6. Perform the testing to ensure the system is correct.
  7. Evaluate the system including comparisons.
  8. Publish the results.

Remark VII: Online asynchronous is also provided for the distance students. It is conducted fully through Internet instruction. For details, check UND Online & Distance Education or DEDP (Distance Engineering Degree Program). Besides, https://und.zoom.us/j/2489867333 or YuJa is used for hosting and sharing lecture videos, and ProctorU may be used to monitor the exams.

Instructor’s qualification: The instructor’s current research interests include (mobile) data research and applications such as (mobile) data security & mining, and mobile/smartphone/spatial/web computing. He has applied various information retrieval methods (such as artificial neural networks, finite-state machines, and association-rule and sequential-pattern mining) to mobile applications and web searches. The instructor has published more than 100 research publications and advised more than 50 graduate students. Most of the research topics are related to (mobile) data engineering, management, and mining.


University of North Dakota Course Description (DATA 525) —
This course studies theoretical and applied issues related to data engineering and mining. Data engineering is to identify, investigate, and analyze the underlying principles in the design and effective use of information systems; and data mining is to discover patterns in large data sets and transform the patterns into a comprehensible structure for further applications. The following topics are covered: data collection, data preparation, data indexing and storage, data processing and analysis, data classification and clustering, knowledge discovery, information retrieval, data visualization, data sharing, data applications, and some other special topics.

Data Science from Wikipedia
Data science is an interdisciplinary field aout scientific methods, processes, and systems to extract knowledge or insights from data in various forms, either structured or unstructured, similar to data mining. It employs techniques and theories drawn from many fields within the broad areas of mathematics, statistics, information science, and computer science, in particular from the subdomains of machine learning, classification, cluster analysis, data mining, databases, and visualization.

Data Engineering from IEEE Computer Society Data Engineering Bulletin
The role of data in the design, development, management and utilization of information systems:

  • Databases and the World Wide Web,
  • Management of semistructured data, metadata and XML,
  • Heterogeneous, distributed, parallel and mobile databases,
  • Data warehousing and OLAP,
  • Data, text and web mining,
  • Optimization of query processing and database architectures,
  • Indexing, access methods and data structures,
  • Temporal, spatial, scientific, statistical, biological databases, and
  • Security and integrity control.

Data Engineering from Data & Knowledge Engineering
Data engineering is to identify, investigate and analyze the underlying principles in the design and effective use of database systems:

  • Representation and manipulation of data,
  • Architectures of database systems,
  • Construction of databases,
  • Applications, case studies, and management issues, and
  • Tools for specifying and developing databases using tools based on linguistics or human machine interface principles.
Data Mining from Wikipedia
Data mining is the process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems:

  • Classification,
  • Clustering,
  • Dependency analysis,
  • Descriptive function,
  • Mining of frequent patterns,
  • Optimization,
  • Prediction and estimation,
  • Probabilistic modelling, and
  • Search.

A Typical Workflow of Data Life Cycle —


A Typical System Structure of Web Search Engines —



An Internet-Enabled and Mobile Database Course Sequence —
This is part of an Internet/mobile-enabled database course sequence offered by me:
CSCI 260 .NET and World Wide Web Programming

CSCI 457 Electronic and Mobile Commerce Systems

DATA 520 Databases

CSCI 513 Advanced Database Systems

CSCI 515 Data Engineering and Management

DATA 525 Data Engineering and Mining
The following platforms, software, and tools used in these courses greatly help students land a decent job:
  • CSCI 260 (.NET and World Wide Web Programming) to build database-driven websites by using

    • Microsoft Access database,
    • Microsoft ASP.NET,
    • Microsoft C# or Visual Basic,
    • Microsoft .NET, and
    • Microsoft Visual Studio.

  • CSCI 457 (Electronic and Mobile Commerce Systems) to build electronic and mobile commerce systems by using

    • Android programming,
    • Android-server-database connection,
    • (L) Linux operating system,
    • (A) Apache web server,
    • (M) MySQL database, and
    • (P) PHP.

  • DATA 520 (Databases) to build Internet/mobile-enabled database systems by using

    • Android programming,
    • Android-server-database connection,
    • JDBC (Java Database Connectivity),
    • Oracle database, and
    • Relational database design and SQL.

  • CSCI 513 (Advanced Database Systems) to build Internet-enabled and embedded database systems by using

    • Android programming,
    • Android SQLite embedded database,
    • JDBC (Java Database Connectivity),
    • Object-relational SQL and PL/SQL, and
    • Oracle (an object-relational database).

  • CSCI 515 (Data Engineering and Management) to build location-based services and data-mining systems to discover knowledge from a large set of data by using

    • Android programming,
    • Android Google APIs and Firebase database,
    • Data mining and knowledge discovery,
    • Information retrieval,
    • Location-based services, and
    • Smartphones and mobile handheld devices.

  • DATA 525 (Data Engineering and Mining) to build Internet-enabled data-mining systems to discover knowledge from a large set of data by using

    • Data mining and knowledge discovery,
    • Internet-enabled Firebase database,
    • Information retrieval, and
    • Internet-enabled TensorFlow.