CSCI 515 Data Engineering and Management

(a course using both practical software development and configuration)
Software/Tools MySQL Oracle Perl PHP SQL
W3Schools Web (Oracle) Web (MySQL) Emacs Linux
Data
Retrieval &
Mining
Information
Retrieval I
Information
Retrieval II
Search
Engines
Text
Analysis
PageRank
Machine Learning Data Mining Kardi Teknomo ANN
Google APIs Firebase TensorFlow G4G TF G4G Firebase W3S TF
General Information Discord EE/CS Wiki EITS UND help Stack Overflow


Syllabus: Spring 2024   Credit hours: 3
Class times: 03:30pm – 04:45pm, TuTh Classroom: Harrington Hall 324
Class # (on-campus: 515-01): 21146 Class # (on-line: 515-02): 21147

Instructor: Wen-Chen Hu   (my teaching philosophy) Office: Upson II 366K
: https://und.zoom.us/j/2489867333 Email: wenchen@cs.und.edu
Office hours: 12:30pm – 02:30pm, TuTh

Prerequisite: CSCI 513 Advanced Database Systems or consent of the instructor

Synchronous class delivery: The class lectures will be delivered synchronously via https://und.zoom.us/j/2489867333, and the Zoom video will be posted on the Blackboard afterwards. Students can watch the video clips anytime they want.

Lecture notes: No textbook will be used. Instead award-winning, interactive, informative, and practical lecture notes (based on books, papers, online documents, and user manuals) and detailed and precise class instructions will be provided. Collectively, the lecture notes and instructions are more like a small book, which supplies much more information than regular notes do and makes the subject studies much easier. Students will not have problem learning the subjects or taking the exams after studying them and doing programming exercises.



Grading:


Announcements:



Tentative Schedule:


Week

Class Topic Due Where
0 0. Computer Career and Data Research & Technologies    
  0.1 A computer career    
  0.2 Data research    
  0.3 Data technologies    
1 01/09
01/11
1. Introduction to CSCI 515    
  1.1 Course introduction    
  1.2 Data life cycle    
  1.3 Topics covered    
2 01/16
01/18
2. Programming Exercise I    
  2.1 Specifications    
  2.2 Web page download    
  2.3 Code sample    
01/18  Last day to add a course or drop without record
 Last day to add audit or change to/from audit
 Last day to receive a refund on a dropped class
 Drops after the last day to add will appear on a transcript.
   
3 01/23
01/25
3. Essential Technologies for Exercise Construction    
  3.1 Essential software and tools    
  3.2 Using Linux    
  3.3 Writing HTML scripts    
4 01/30
02/01
4. PHP (HyperText Preprocessor)    
  4.1 LAMP    
  4.2 PHP    
  4.3 MySQL    
5 02/06
02/08
5. Web Search Services    
  5.1 The World Wide Web    
  5.2 Web page information    
  5.3 Web search methods    
6 02/13
02/15
6. Information Retrieval (IR)    
  6.1 Various IR methods    
  6.2 Automatic indexing methods    
  6.3 Data classification and clustering EX I  
7 02/22 7. The PageRank Algorithm    
  7.1 Background    
  7.2 The PageRank algorithm    
  7.3 Computing PageRank scores    
02/20
Exam I (for both on-campus and on-line students; 6:30pm – 8:30pm, Tuesday)
   
8 02/27
02/29
8. Firebase Database    
  8.1 Programming Exercise II    
  8.2 Introduction to Firebase    
  8.3 Using Firebase    
9
03/04

03/08
  Spring Break — no classes
   
10 03/12
03/14
10. TensorFlow    
  10.1 TFJS operations    
  10.2 TFJS models    
  10.3 TFJS visor    
11 03/19
03/21
11. A TensorFlow.js Example    
  11.1 Example introduction    
  11.2 Example model    
  11.3 Example training    
12 03/26
03/28
12. JavaScript    
  12.1 JavaScript syntax    
  12.2 JavaScript instructions    
  12.3 JavaScript examples    
13 04/02
04/04
13. Decision Trees    
  13.1 Background    
  13.2 Measuring impurity    
  13.3 Information gain    
04/05  Last day to change to or from S/U grading
 Last day to change to or from audit grading
 Last day to drop a full-term course or withdraw from school
   
14 04/11 14. k-Nearest Neighbors (kNN) Algorithm    
  14.1 Background    
  14.2 kNN for prediction and smoothing    
  14.3 Strengths and weaknesses    
04/09
Exam II (for both on-campus and on-line students; 6:30pm – 8:30pm, Tuesday)
   
15 04/16
04/18
15. Artificial Neural Networks (ANNs)    
  15.1 Artificial intelligence    
  15.2 Backpropagation    
  15.3 Genann: a minimal ANN    
16 04/23
04/25
16. Data Processing and Management    
  16.1 Data science    
  16.2 Data warehouse    
  16.3 Data fusion    
17 04/30
05/02
17. Data Mining Concepts    
  17.1 Introduction to data mining    
  17.2 Data mining steps  
  17.3 Data mining techniques EX II  
18 05/07
Final exam (for both on-campus and on-line students; 06:30pm – 08:30pm, Tuesday)
   
19 05/14 Grades posted before noon, Tuesday    


According to US News, Best Tech Jobs of 2023 are listed as follows:
  1. Software developer (median salary: $120,730)
  2. Information security analyst (not developer; median salary: $102,600)
  3. IT manager (not developer; median salary: $159,010)
  4.  Web developer  (median salary: $77,030)
  5. Computer systems analyst (not developer; median salary: $99,270)
  6.  Data scientist  (median salary: $100,910)
  7.  Database administrator (including developing; median salary: $96,710)
  8. Computer network architect (not developer; median salary: $120,520)
  9. Computer system administrator (not developer; median salary: $80,600)
  10. Computer support specialist (not developer; median salary: $49,770)
  11. Programmer (median salary: $93,000)


Computer science is different from many other disciplines (like electrical engineering). It is more like a professional school (such as culinary schools), which emphasizes practical works instead of subject studies because many IT companies want the new recruitees to start contributing immediately. There are three kinds of computing personnel:
  • Developers:

    • Positions (plenty): Developers of front-end and back-end web pages, mobile apps, and all kinds of software
    • Skills (more stable): Programming languages (such as C++ and Java), web programming, mobile app development, data processing and management including databases, and data structures & algorithms

  • Practitioners:

    • Positions (not many): Experienced personnel like data scientists, database or system administrators, security analysts, and network architects (more applications & configuration and less development)
    • Skills (based on the needs of companies): Databases, data warehousing, data lake, Hadoop, MapReduce, Linux, SPSS, SAS, Cogno, Matla, Taleau, etc.

  • Researchers:

    • Industrial positions (few and based on the needs of corporations): High quality personnel required for the advanced areas like artificial intelligence, security, computer vision, autonomous driving, and speech recognition
    • Academic positions/trends (few and changed according to the government policies): ❓ ⇐ artificial intelligence ⇐ big data ⇐ high-performance computing ⇐ security ⇐ (mobile) networks
Unless you have an impressive resume or a strong connection, practicing tens or hundreds of questions posted at the LeetCode is a must in order to secure a job at corporations (like Google and Facebook). Otherwise, your chance of answering the questions correctly is low because of their high difficulty and time constraint. In addition, you need to create LinkedIn pages to show your achievements, and may consider uploading your projects to the GitHub to showcase them.



Remark I: Terminologies and definitions will be discussed minimally in this course. Instead, (i) effective methods and practical works will be emphasized and enforced, (ii) the trend of (mobile) data engineering and management will be discussed, and (ii) smartphone structures will be studied.

Remark II: Unlike the disciplines such as databases or the World Wide Web, data engineering and management (DEM) is one of the disciplines (like image processing or artificial intelligence) without coherent methods or algorithms. Many methods (such as artificial neural networks or relevance feedback) are used by DEM and each method is usually not closely related to other methods (like decision trees or sequential pattern mining).

Remark III: In order to show what the data engineering and management (DEM) is in a semester, this course has to pick a small number of fundamental topics, instead of many topics, to investigate. Students then use the training to choose appropriate methods for the problems they encounter in the future.

Remark IV: Data engineering and management (and information retrieval) is a mature subject. A wide variety of methods have been applied to it, and the current methods are rather complicated because of its maturity. In order to cover more topics, the methods introduced in this course are fundamental or primitive. Students learn how the DEM methods work, and may try to enhance the methods or apply them in their programming exercises.

Remark V: The DEM is a well-developed subject, and it is not easy to find a brand-new method. On the other hand, artificial intelligence (AI), data mining (DM), machine learning (ML), or information retrieval (IR) has plenty of methods available to be used or adopted. In order to take the advantages from both, the DEM borrows many methods from AI/DM/ML/IR. However, the DEM is not the same as AI/DM/ML/IR because of the problem of data processing. That is a data research topic may consist of two parts: DEM and AI/DM/ML/IR, and you want to put an emphasis on the former instead of the latter because the DEM is more useful and practical.

Remark VI: Take the following steps to conduct research:

  1. Identify a problem.
  2. Study related literature and methods.
  3. Create/adapt a method to solve/suit the problem.
  4. Figure out how to improve the method.
  5. Complete the implementation.
  6. Perform the testing to ensure the system is correct.
  7. Evaluate the system including comparisons.
  8. Publish the results.


Remark VII: Online asynchronous is also provided for the distance students. It is conducted fully through Internet instruction. For details, check UND Online & Distance Education or DEDP (Distance Engineering Degree Program). Besides, https://und.zoom.us/j/2489867333 or YuJa is used for hosting and sharing lecture videos, and ProctorU may be used to monitor the exams.

Instructor’s qualification: The instructor’s current research includes mobile computing and information retrieval. He has applied various information retrieval methods (such as artificial neural networks, finite-state machines, and association-rule and sequential-pattern mining) to mobile applications and web searches. The instructor has published more than 100 research publications and advised more than 50 graduate students. Most of the research topics are related to (mobile) data engineering and management.


University of North Dakota Course Description (CSCI 515) —
This course studies theoretical and applied research issues related to data engineering, management, and science. Topics will reflect state-of-the-art and state-of-the-practice activities in the field. The course focuses on well-defined theoretical results and empirical studies that have potential impact on data acquisition, analysis, indexing, management, mining, retrieval, and storage.

Data Science from Wikipedia
Data science is an interdisciplinary field about scientific methods, processes, and systems to extract knowledge or insights from data in various forms, either structured or unstructured, similar to data mining. It employs techniques and theories drawn from many fields within the broad areas of mathematics, statistics, information science, and computer science, in particular from the subdomains of machine learning, classification, cluster analysis, data mining, databases, and visualization.

Data Engineering from IEEE Computer Society Data Engineering Bulletin
The role of data in the design, development, management and utilization of information systems:

  • Databases and the World Wide Web,
  • Management of semistructured data, metadata and XML,
  • Heterogeneous, distributed, parallel and mobile databases,
  • Data warehousing and OLAP,
  • Data, text and web mining,
  • Optimization of query processing and database architectures,
  • Indexing, access methods and data structures,
  • Temporal, spatial, scientific, statistical, biological databases, and
  • Security and integrity control.

Data Engineering from Data & Knowledge Engineering
Data engineering is to identify, investigate and analyze the underlying principles in the design and effective use of database systems:

  • Representation and manipulation of data,
  • Architectures of database systems,
  • Construction of databases,
  • Applications, case studies, and management issues, and
  • Tools for specifying and developing databases using tools based on linguistics or human machine interface principles.
Data Management from Wikipedia
Data management comprises all the disciplines related to managing data as a valuable resource:

  • Data governance,
  • Data architecture, analysis and design,
  • Database management,
  • Data security management,
  • Data quality management,
  • Reference and master data management,
  • Data warehousing and business intelligence management,
  • Data, text and web mining,
  • Optimization of query processing and database architectures,
  • Indexing, access methods and data structures,
  • Temporal, spatial, scientific, statistical, biological databases, and
  • Security and integrity control.

Each student is required to build the following two systems:
  • a focused web search engine based on a data life cycle and
  • a data mining system using Firebase and TensorFlow.




An Internet-Enabled and Mobile Database Course Sequence —
This is part of an Internet/mobile-enabled database course sequence offered by me:
CSCI 260 .NET and World Wide Web Programming

CSCI 457 Electronic and Mobile Commerce Systems

DATA 520 Databases

CSCI 513 Advanced Database Systems

CSCI 515 Data Engineering and Management
The following platforms, software, and tools used in these courses greatly help students land a decent job:
  • CSCI 260 (.NET and World Wide Web Programming) to build database-driven websites by using

    • Microsoft Access database,
    • Microsoft ASP.NET,
    • Microsoft C# or Visual Basic,
    • Microsoft .NET, and
    • Microsoft Visual Studio.

  • CSCI 457 (Electronic and Mobile Commerce Systems) to build electronic and mobile commerce systems by using

    • Android programming,
    • Android-server-database connection,
    • (L) Linux operating system,
    • (A) Apache web server,
    • (M) MySQL database, and
    • (P) PHP.

  • DATA 520 (Databases) to build Internet/mobile-enabled database systems by using

    • Android programming,
    • Android-server-database connection,
    • JDBC (Java Database Connectivity),
    • Oracle database, and
    • Relational database design and SQL.

  • CSCI 513 (Advanced Database Systems) to build Internet-enabled and embedded database systems by using

    • Android programming,
    • Android SQLite embedded database,
    • JDBC (Java Database Connectivity),
    • Object-relational SQL and PL/SQL, and
    • Oracle (an object-relational database).

  • CSCI 515 (Data Engineering and Management) to build Internet-enabled data-mining systems to discover knowledge from a large set of data by using

    • Data mining and knowledge discovery,
    • Internet-enabled Firebase database,
    • Information retrieval, and
    • Internet-enabled TensorFlow.