data science

The data science decade has focused significant attention on the discipline of statistics as undergraduate enrollments have surged nationwide.

Driving the interest in statistics is the explosion of data being produced by scientists, big Internet firms, and electronic devices of every size and shape. There is broad recognition coming naturally from scientists, but also from CEOs and policymakers that these statistics might be of value in helping inform policy and business decisions. To this extent, the employment market for individuals capable of interpreting these statistics into useful information is extremely robust, and indications are that demand for that kind of work significantly outstrips supply. Students, by all accounts, are clamoring to improve their data analysis skills and prudently investing in those. But although this flood of data has heightened interest in statistics, it has also created problems.

Contemporary data science streams are significantly different from the data that many statisticians, and consequently many statistics students, are familiar with. For instance, the typical data set with which a student becomes familiar in an introductory course in statistics comprises some dozens of rows and three or four non-collinear variables in columns, gathered through a simple random sample or a randomized trial. Such data are likely to fulfill the requirements to support statistical inference within a multiple regression model.

Pedagogically, this pleases both the instructor and the students since the data conforms to the model, and therefore, python for data science, data analyst, data science degrees, applied data science, data analysis online we can now go ahead and use the techniques that we have learned to make valid conclusions. The data that many of our current students will be required to analyze, particularly if they enter government or industry, however, will not be as neat.

These data will not be from an experiment, they will be much more likely to be observational. Secondly, they will not be likely to be in a two-dimensional row-and-column form, they may be in a database or a structured text file, or from multiple sources with no clear connecting identifier, or worse, no structure at all, e.g., web-scraped data.

Background and Related Work

Although most people think that a good mathematics foundation is needed, it appears evident that computer data science skills are needed for one to be an effective, practicing statistician. In drawing this parallel, Nolan and Temple Lang make a strong case for a greater role of computing in the study of statistics. Quoting them, the American Statistical Association Undergraduate Guidelines Workgroup also highlights the significance of computer skills, even going as far as to use the term data science in the 2014 undergraduate major guidelines for statistical science.

Here, statistical programming in an environment like R is what we refer to as computer skills. This should be recognized as a unique and higher-level ability than being capable of conducting statistical analyses in a click-and-menu situation like Minitab. Nolan and Temple Lang take it even further, data science programs, learn python data, udacity data analyst, data science 365, coding dojo, data science supporting the value of learning general command-line programs, like grep for regular expressions and other standard UNIX commands that have nothing to do with statistics, strictly speaking, but are extremely useful for cleaning and manipulating documents of all sorts.

While working statisticians appear to be almost unanimous in recognizing that most of the time spent on most projects is expended on data cleaning and manipulation or data wrangling skills, as it sometimes is euphemistically referred to, the impetus for including these skills in the statistics curriculum is not merely convenience, nor should a lack of interest or ability on the part of teachers be the barrier.

The Course

Here, we report on an experimental course named Data Science, currently taught as part of the Statistical & Data Sciences Program at Smith College, that was first taught in the fall of 2013 and repeated in the fall of 2014. We had 18 students take the course in the first year and another 24 the next year. A beginning statistics course and some coding background are required prerequisites. Current courses at Berkeley, Macalester, and St. Olaf Colleges are the pedagogical relatives of SDS 292 for a systematic comparison. SDS 292 is structured as a sequence of two- to three-week modules, data visualization, data manipulation, data wrangling, computational statistics, machine, statistical learning, miscellaneous topics.

Data Visualization

From the start of class, students are given the reminder that statistical work will have limited worth if it can’t be made intelligible to non-statisticians. To be more precise, most government and industry-employed data science, in contrast with academics, will have a manager who typically is less technically literate than the worker. An ideal but intricate statistical model might not be convincing to non-statisticians if it cannot be effectively communicated.

Data graphics skills offer a means of presenting relationships between data, yet most students have never been exposed to organized thoughts regarding how to design effective data graphics. During SDS 292, the first ten class days are spent on data visualization. It serves a dual purpose: it is a fascinating hook for a science class, and it provides an opportunity for students with poorer programming backgrounds to familiarize themselves with R.

Data Manipulation

As was mentioned before, it is a frequent complaint of statisticians that the majority of time spent on a statistical project is consumed by data science cleaning and manipulation. In our first-year class, we try to insulate students from this fact as much as possible, showing them only nicely prepared data sets. In SDS 292, on the other hand, students will be challenged to become proficient in a range of common data manipulation methods.

The phrase data management sounds dull and IT-related, but increasingly, there is recognition that such data wrangling manipulation abilities are not only useful but part of a larger intellectual practice. One of the key objectives of SDS 292 is to build students’ ability to think with data in both a theoretical and practical manner. In the course of the following three weeks, students receive quick learning about data manipulation using R and SQL. Along the lines of the data manipulation verbs promoted by Wickham and Francois, students learn how to execute the most basic data operations in both SQL and R and are encouraged to consider their relation.

Machine Learning

Two weeks are dedicated to foundational machine learning topics. Some educators will discover that this section of the course repeats too much of what is already being taught in computer science or applied statistics. Others will claim that students who are mostly interested in the communication and visualization aspects of data science will not be interested in these subjects. A brief overview of machine learning does provide students with a practical context for trying out algorithmic models, though.

Assignments compel them to deal with the constraints of large data sets and seek statistical methods that are not introductory. To understand machine learning, one has to see the differences between the data miner’s mindset and the statistician’s. Bierman introduces two forms of models f for y, the response variable, and x, a vector of explanatory variables. It would be possible to regard a data model f so that y ≈ f. Provided one can identify whether f is a good enough approximation of the true process that gave rise to y from x, then one is in a position to start to make inferences on f.

We want to find out about that unknown true process, and the assumption is that f captures what is interesting in it. To compute, Practical programming and computational skills are what a data scientist requires, and for that reason, every effort is eschewed to protect the student from having to write his code. Extensive examples are provided, and lengthy lecture notes with marked computations in R are distributed per class.

Lectures alternate between working through ideas on the blackboard and programming in front of them on the computer overhead, and students are expected to come in every day with their laptops and engage actively. Although it is the case that most of the students have difficulties with the programming part of the course, even those that do show enthusiasm and satisfaction as they gain confidence. Newly interested in becoming data scientists, some students will proceed to take follow-up courses on data structures or algorithms from the computer science department.

Conclusion

The popularity of data science has created opportunities and challenges for the profession of statistics. Whereas statisticians are openly struggling with issues of how our discipline relates to data science, there is less disagreement among computer scientists, who rightly or wrongly separate data science wrangling from statistics based on heterogeneity and the absence of structure of the data with which data scientists, as opposed to statisticians, deal.

Since Big Data, which is connected to but too frequently equated with data science, is commonly linked to computer science, computer scientists have an open mind towards best data science course. There’s an old joke that a data scientist is a statistician who lives in San Francisco, but Ph.D. statistician Hadley Wickham suggested on Twitter a more cynical version: a data scientist is a statistician who is of use.

Table of Contents

Leave a Comment Cancel Reply

Find us in social media

Our Community

About us

Disclaimer

Privacy policy

Copyright - 2024- 2025- owneducated.com - All Rights Reserved. Developed by Ueducate.pk